Extracting secondary data from citizen science images reveals host flower preferences of the Mexican grass‐carrying wasp Isodontia mexicana in its native and introduced ranges

Abstract We investigated the plant‐pollinator interactions of the Mexican grass‐carrying wasp Isodontia mexicana—native to North America and introduced in Europe in the 1960s—through the use of secondary data from citizen science observations. We applied a novel data exchange workflow from two global citizen science platforms, iNaturalist and Pl@ntNet. Images from iNaturalist of the wasp were used to query the Pl@ntNet application to identify possible plant species present in the pictures. Simultaneously, botanists manually identified the plants at family, genus and species levels and additionally documented flower color and biotic interactions. The goals were to calibrate Pl@ntNet's accuracy in relation to this workflow, update the list of plant species that I. mexicana visits as well as its flower color preferences in its native and introduced ranges. In addition, we investigated the types and corresponding frequencies of other biotic interactions incidentally captured on the citizen scientists' images. Although the list of known host plants could be expanded, identifying the flora from images that predominantly show an insect proved difficult for both experts and the Pl@ntNet app. The workflow performs with a 75% probability of correct identification of the plant at the species level from a score of 0.8, and with over 90% chance of correct family and genus identification from a score of 0.5. Although the number of images above these scores may be limited due to the flower parts present on the pictures, our approach can help to get an overview into species interactions and generate more specific research questions. It could be used as a triaging method to select images for further investigation. Additionally, the manual analysis of the images has shown that the information they contain offers great potential for learning more about the ecology of an introduced species in its new range.


| INTRODUC TI ON
Biological invasions are a major cause of biodiversity loss, but can also have negative impacts on the economy and human wellbeing (IPBES, 2019).Environmental impacts of alien species are manifested through a variety of mechanisms, but most refer to direct interactions of the invading species with other species, such as predation, herbivory and parasitism (Blackburn et al., 2014;IUCN, 2020).By the time negative impacts of an introduced species are noticed, the species is often too well established to eradicate it, or even manage it effectively (Simberloff et al., 2013).It is obviously important to detect an invasion as early as possible, and to gather information on the impact of a potentially invasive species in the area of interest, so that management options can be effectively prioritized.
We, therefore, need to increase the spatial and temporal coverage of data on biotic interactions as well as the speed of the data collection process, while maintaining a high quality of the data.One of the solutions is to expand data collection by involving the public (Pocock et al., 2018), as has increasingly been done in the last decade in the context of citizen science (Callaghan et al., 2019;Feldman et al., 2021;Pocock et al., 2017).People upload their nature observations supported by multimedia evidence, such as images, on citizen science platforms for biodiversity inventory and monitoring such as iNaturalist, Pl@ntNet and obser vation.org.These species records contribute to a large proportion of open biodiversity data, about 50% of the data hosted by the Global Biodiversity Information Facility in 2019(gbif. org, Waller, 2019)).Now judged to be "mainstream" (Callaghan et al., 2019), citizen science has proved its value to track, surveille and manage biological invasions in different forms (Encarnação et al., 2021).For example, various programs are in operation as early warning systems, e.g., for the spread of invasive mosquito species or plant diseases (Brown et al., 2017;Palmer et al., 2017).Moreover, to build support for management, citizen science is also useful, e.g., when the public helps with the elimination or control of invasive species (Miralles et al., 2016).But beside detection and management, citizen science still holds further untapped potential for invasion biology and ecology in general.Specifically, when it comes to monitoring biotic interactions of an introduced species in its new range, the multimedia voucher of the species record (i.e., image, sound, video, text) can be of additional value.
Researchers are not only using the primary data (i.e., the observed species together with its observation date and location), but the additional information captured with the citizen observationsthe so-called secondary data (Callaghan et al., 2021).Secondary data comprise the information that can be extracted from the observation evidence itself, as a by-product of the record of a species at a particular time and place.The additional information may include details of the observed species' ecological interactions, morphology, behavior, habitat and various other aspects of its traits and ecology (Pernat et al., 2024).
In this paper, we focus on the secondary data that can be extracted from iNaturalist observations using two different approaches (manually and applying an automated workflow), particularly aiming to increase knowledge on biotic interactions.We selected the Mexican grass-carrying wasp, Isodontia mexicana (Saussure, 1867), Sphecidae, as a case study, and compared its flower preferences in its native and introduced ranges.Such exercises are useful in the assessment of the potential impact of an alien species in an introduced area.The two-centimeter-long wasp, nests in small cavities, including bamboo or reed, whose several breeding chambers are lined with plant material and closed with a tuft of grass, the blades of which the female brings one at a time-hence the vernacular name.With North America as its native range, the wasp was introduced in Europe in the middle of the last century with first observations in the South of France (Kelner-Pillaut, 1962).Since then it has been spreading successfully across Europe (Schirmel et al., 2020) and has been detected in at least 16 European countries so far (Bosch et al., 2018;Burton et al., 2019).This wasp is ideal to investigate plant-insect interactions for several reasons.First, it has a characteristic and striking appearance and feeds on nectar, which is why it is frequently recorded and mostly on plants (Figure 1).Second, little is known about its potential ecological impacts in Europe (Turrisi, 2020).Third, iNaturalist observations include records from its introduced and native range, which makes a comparison between both ranges possible.Fourth, because of the wasps' behavior of carrying grass leaves and prey for their larvae (i.e., bush-crickets and crickets) into their nests, other types of interactions may be recorded by citizen scientists in addition to plant visitations that can be extracted as secondary data.
In this study, we investigated the possibility of inferring information about interactions between plants and I. mexicana from iNaturalist images.iNaturalist was launched in 2008 by the California Academy of Sciences (iNaturalist, 2022) and has become an independent non-profit organization for recording and sharing nature's observations worldwide, mostly evidenced through photographs of the organism.It is probably the single largest source of open insect observations globally.These observations are primarily used to identify the species, date, and place, and thus provide scientists with research quality data to understand and protect biodiversity.In contrast to other studies exploiting secondary data, we not only used manually extracted information on plants and interactions from these images (e.g., plant species, flower color and type of interaction), but also applied a novel data exchange approach.
We evaluated the applicability of identifying the host plants of I. mexicana by means of the Pl@ntNet API, meaning, we tested whether it is possible to let Pl@ntNet identify the plants on the iNaturalist I. mexicana image records.Pl@ntNet, akin to iNaturalist in functionality, focuses on the domain of plant identification using computer vision (Bonnet et al., 2020).Developed in France in 2009, it is one of the pioneering mobile applications exclusively dedicated to discerning plant species (Pl@ntNet, 2022).In this paper we show how iNaturalist images and the Pl@ntNet image classifier can be combined to extract secondary data on flower visitation from citizen science observations.Besides demonstrating the feasibility, our study had the following content-related and methodological goals (i) to assess the accuracy of Pl@ntNet identifications by establishing confidence score thresholds above which identifications agree with botanical experts at species, genus, and family levels.
(ii) to identify possible host plant and flower color preferences of I. mexicana in its native and introduced range by means of a secondary data approach and to compare the results with existing literature.
(iii) to identify other biotic interactions captured in the images.

| The iNaturalist-Pl@ntNet automated data exchange workflow
Observations of I. mexicana were retrieved from the iNaturalist platform on March 6th, 2022, via the rinat package (Barve & Hart, 2021).
Only research grade observations, those where community consensus has been reached, were used.This resulted in a dataset with 1741 images (see Figure 1 for an example image).Via the image URLs of the iNaturalist pictures, potential plant species displayed on the images as secondary data were identified by Pl@ntNet, applying the command identify() from the corresponding plantnet package (August, 2019).For two iNaturalist image URLs, Pl@ntNet returned Species not found as results, so that the final dataset comprised 1739 images.
We added the columns of suggested species families and genera to the dataset by applying the packages stringr (Wickham, 2022) and the command plantminer() from the taxize package (Chamberlain & Szocs, 2013).Automatically, Pl@ntNet provides a list of plant species suggestions with a corresponding confidence score from 0 to 1 for each candidate that indicates their relative likelihoods.The higher the confidence score, the greater the likelihood that the PlantNet classification model has predicted the correct plant species.

| Manual plant identification and validation
To verify the plant species candidates suggested by Pl@ntNet, botanical experts (DM and QG) examined the iNaturalist images of I. mexicana, identified plant families and, when possible, genera and species.In some cases, experts categorized plants in images as unidentifiable (e.g., insufficient discernable traits for identification) or, if no flower or plant was visible at all, as noplant.In addition to the dataset generated by the automated workflow, the color of the flower and the type of interaction was noted.The color of the flower was extracted by only one author (DM) using self-selected color categories to avoid inconsistencies in rating.In addition to flower color, biotic interactions other than the pollinator-plant relationship may be randomly detected in the images.Images were visually inspected for other biotic interactions and when discovered, also assigned a category by DM.Besides the wasp feeding on nectar or visiting a flower (both categorized as visiting), the insect was photographed with prey (preying), resting (resting), interacting with humans (e.g., sitting on fingers, human), or in other interactions such as mating, dead (also as museum collections) or preyed upon (other).
The identifications of plant species from the iNaturalists images were done blind, meaning, the Pl@ntNet suggestions were not provided to the experts, in order to avoid bias.Of the 1739 images, the botanists identified 36.3% plants at species level, 60.3% at genus level and 69.2% at family level (Table 1).No plant was visible on 194 images, so no determination was possible.

TA B L E 1
Proportion of identifiable plant species, genera and families from iNaturalist images by experts (unidentifiable = plant/ flower features not sufficient for identification or beyond the geographical knowledge of the experts).

| Statistical analysis
For comparison, the expert dataset was merged with the highest scoring species candidate suggested by Pl@ntNet for each image, so that the comparison was made between the expert identification as benchmark and the most probable (highest scored) plant species suggested by Pl@ntNet.Descriptive analysis, chi-square and figures were created with R-4.1.0(R Core Team, 2022), applying package tidyverse (Wickham et al., 2019), treemapify (Wilkins, 2021), and viridis (Garnier et al., 2021).Range labelsnative North America and introduced Europe-were assigned by means of the packages sf (Pebesma, 2018) and rworldmap (South, 2011).Two observations were lacking geo-references, around 100 observations needed to be assigned to a continent manually, and two observations were excluded after visual inspection confirmed they originated from Hawaii, and Trinidad and Tobago, respectively.The exclusion was based on the focus of the study on continental North America and the differences in flora and invasion processes compared to islands.As a result, the spatial analysis was conducted with 1735 observations, 577 from Europe and 1158 from North America (Figure 2).Significant differences in frequencies of colors of host flowers were investigated with a Chi-square test.

| Accuracy of species identification by Pl@ntNet
To obtain a better assessment of Pl@ntNet's performance and accuracy thresholds, we compared the results from the expert identification with the first suggestion (highest score) for each image (mean score = 0.21, median score = 0.15).Table 2 shows the number of images and percentage of matches between Pl@ntNet and the expert identification for different confidence score thresholds, considering only the observations that were identified by the experts (i.e., did not fall into either the unidentifiable or the noplant category).
The proportion of matches, that is, the agreement in plant genera and families of expert and app identifications, was over 90% from a confidence score of 0.5, while at the species level only 75% of the identifications matched at a score higher than 0.8 (Table 2).candidate suggestion with a score higher than 0.5, grouped by native (North America) and introduced range (Europe).To update the plant species list I. mexicana interacts with in its introduced region, we tabled the results from the experts verifying the iNaturalist images manually and indicated for which species there was an agreement with Pl@ntNet (Appendix).

| Flower color and other types of interactions recorded
The most frequently photographed flower color associated with I. mexicana was white, followed by yellow, and then, by a wide margin, purple, blue, and pink.On both continents white accounted for more than 40% of flower colors but the frequency of colors was significantly different (χ 2 = 69.996,df = 24, p < .001).Variability in other hues was greater in Europe than in North America, where yellow accounted for another third of flowers (Figure 4).
Feeding or sitting on flowers (visiting) was also the most common interaction in which the wasp was photographed, but resting on non-flowering plants was also frequently recorded (Figure 5).

| DISCUSS ION
Isodontia mexicana is a large and conspicuous insect with a fascinating behavior and is, therefore, frequently recorded on iNaturalist.The wasp is usually photographed sitting on plants and it depends on the photo taking behavior of citizen scientists, which parts of the plant are visible and if these parts suffice for identification.In fact, the plantnet R package had to be updated for our study to prevent Pl@ntNet from rejecting images (about 25%) on which the insect appears too prominently.Rejecting images is a safety feature of Pl@ntNet to avoid images that might violate the privacy of participants (personal communication).For macro images and in many other cases (e.g., plants with no flower, no plant at all), Pl@ntNet provided long lists of suggestions with relatively low scores, and we assume that any agreement of species identification with the experts' assessment at scores lower than 0.3 was likely a coincidence.
However, by reducing the dataset to the first suggestions of Pl@ ntNet (candidates with the highest score), we were able to estimate thresholds for the accuracy of Pl@ntNet identifications using the expert identifications as benchmark.For our case study of obtaining taxonomic rank information on pollinator-plant interactions, the threshold for a 75% correct plant species identification lies above a confidence score of 0.8 and for a more than 90% correct genus and family identification above a confidences score of 0.5.However, only a small number of the images reached values above 0.8 and 0.5 (2.1% and 8.9% of the analyzed image URLs, respectively), but once the programming was done the whole procedure only took a couple of minutes.The experts identified over a third of the plants in the photos at species level (36.6% of the total analyzed image URLs), but the process is more time-consuming.In addition, more than half of the images could not be used for accurate identification of the plants because they did not display enough distinct plant features for the experts to identify at species level with high confidence.
Consequently, the automatic approach is worthwhile for the explorative analysis of a large number of images within hours, that, for example, experts are capable of manually processing only with a considerable investment of time.Moreover, the use of Pl@ntNet can be practical when working on a broad geographic scale, such as the continents, as even the best botanists do not necessarily have identification skills that span the globe.The results show that determining TA B L E 2 Total and percentage of agreement between Pl@ntNet and expert species identification at different confidence score thresholds up-and downward.The higher the score, the more likely it is that the model has predicted the correct plant (species).Therefore, with a confidence score of 0.8, the PlantNet identification is much more likely to be accurate than with a score of 0.5.The observations (n) are listed as total observations/observations with expert identification for species, genus and family level.Each percentage value relates to the total observations identified by experts for every level.
F I G U R E 3 Matching plant species (a), genera (b) and families (c) from expert and Pl@ntNet identification, with scores higher than 0.5.
species level interactions between plants and pollinators using iNaturalist images is challenging for both computers and humans.
Therefore, it would be best if this information was provided by the iNaturalist user along with the observation or during community consensus identification.iNaturalist already provides several ways to note interactions in observations, which have already been included in research (Kirchhoff et al., 2021;Maritz & Maritz, 2020).
However, we recommend this feature is improved with the use of standard controlled vocabularies and clear guidance on how these fields should be filled in or used.
At the plant genus and family level, Pl@ntNet proved to be reliable even at lower confidence scores.This opens up the possibility of quickly an overview of host plant richness and frequencies and thus making initial statements about potential insect preferences at these levels.In this way, ecological insights can be gained not only into insects for which little is known concerning their pollination behavior (e.g., mosquitoes and other Diptera species) or that have been recently introduced and for which data on their interactions in their introduced range are lacking.Also, studies that already use iNaturalist occurrence data to model habitat suitability for invasive and native species (Beninde et al., 2023;Dart et al., 2022;Serniak et al., 2023) could refine their models and their predictive power by including secondary data information such as preferred host plant communities and other biotic interactions (Cosentino et al., 2023).For example, Halsch et al. (2020) showed that besides abiotic factors, like minimum winter temperature, the availability of the ornamental host plant, Passiflora ssp., in its introduced region of the USA limits the distribution of the gulf fritillary butterfly (Agraulis vanillae).
Looking at the species on which the experts and App agree, the plants most frequently visited by I. mexicana in its native range, according to iNaturalist observations, are species of the genus Pycnanthemum, particularly Pycnanthemum tenuifolium, the narrowleaved mountain mint.This genus is known to provide rich nectar sources for pollinators (Mader, 2011).It is native to North America and, therefore, does not appear on citizen science images taken in Europe.In Europe, for the genus Eryngium, the species Eryngium giganteum, Eryngium planum, and Eryngium campestre are the most common matches of expert and App, as is also the case for Cirsium arvense.Of the Eryngium species, Eryngium giganteum is not native to most European countries, but is native to the steppes of the Middle East and is often planted in parks and gardens as an ornamental.
Eryngium attracts a wide range of insects (von der Dunk, 2021) and represents the most species-rich genus in the Apiaceae family.Although some Eryngium species are native to North America, there were matches only for the introduced Eryngium planum.These results may be the first indication of how a missing plant genus in an introduced range, here Pycnanthemum in North America, is replaced in Europe as a source for nectar by a native genus, namely Eryngium.
Based on our dataset, the plant families also appear to show very clear preferences of I. mexicana, with Lamiaceae being the most often visited in North America due to the frequency of Pycnanthemum.However, Asteraceae, Fabaceae and Apiaceae are also commonly visited in the New World, while in Europe only Asteraceae and Apiaceae show higher frequencies.Interestingly, all families except Clethraceae and Araliaceae are distributed worldwide, although according to our analysis most families are visited either in North America or in Europe.Here the limits of interpretation become obvious: the data are probably biased by the absolute abundances of the species of the different plant families, their respective species richness and also probably by their occurrence in proximity to humans, for example, in parks, gardens or recreational areas, or recorder taxonomic preferences (Boakes et al., 2016;Pernat et al., 2021).Therefore, as with the use of citizen science data in other contexts, one has to be careful about interpreting frequencies and proportions.However, this also clearly shows how quickly and on what large scales descriptive information about biotic interactions can be obtained via our data exchange approach that cannot even be collected through decades of fieldwork (see Appendix).The results of plant-pollinator interactions are largely consistent with what is already known from the literature, but other plants have been found, which I. mexicana feeds on or visits, such as Smyrnium perfoliatum, Falcaria vulgaris or Euphorbia glareosa.
The distribution of flower color derived from iNaturalist images shows that I. mexicana predominantly visits white and yellow flowers in both North America and Europe.The significant difference in flower color may be due to higher variability in Europe, where purple and blue flowers also reach proportions greater than 5%.
Looking only at the frequencies of plant families on which App and experts agreed, one might infer from field experience that Apiaceae and Asteraceae frequently have white or yellow flowers, while Lamiaceae species frequently flower white, or purple and pink.But studies looking at the frequency of flower colors in these families and possible geographic differences seem to be lacking on this large spatial scale to confirm the anecdotal impression.Our approach shows the feasibility and potential to retrieve flower color information from citizen science data.To our knowledge, only the study of Catron et al. (2023) utilized citizen science data to learn about color preferences of two soldier beetle species across the USA.However, it is slow to do manually, and with thousands of images, there is a need for a method to automatically identify flower color from the images.Recent publications using, for example, k-means clustering, to extract flower colors, give hope for automation in the near future (Gibert et al., 2022;Perez-Udell et al., 2023).
Similarly promising to extracting flower coloration-and similarly time-consuming if done manually-is classifying and analyzing other biotic interactions captured in the images by iNaturalist observers.However, this study indicated that the PlantNet classifier would be suitable for filtering out images that do not show plants by applying a lower confidence threshold.In this way, images could be pre-selected and then visually inspected in a hybrid intelligence approach (Rafner et al., 2022).The dataset contained a few but fascinating photos of the wasp, which lives up to its name by carrying or biting off grass leaves and transporting grasshoppers and crickets in its clutches to nests.Images from citizen science projects, as well as from social media, are useful for learning more about, for example, natural enemies or the diet of target species over a larger geographic area (Maritz & Maritz, 2020;Panter & Amar, 2021).
For I. mexicana, preyed bush-cricket and cricket species have only been surveyed in country-specific field studies (Amiet, 2009;Bitsch, 2010;Scaramozzino & Currado, 1988;Tussac & Voisin, 1989;Westrich, 1998) where, for example, in Germany the entry of species into artificial nesting aids was evaluated (Schirmel et al., 2020).Here, despite the small proportion of 52 images (3.2%), a closer look at the preyed orthopterans or other species would be worthwhile to expand or confirm the state of knowledge.
In addition to the above-mentioned caveats regarding the accuracy of identification and the restricted number interpretable images, there are some other specific and general limitations of this workflow.Pl@ntNet only allows a limited number of images for identification per day per user, so a comprehensive analysis can only take place in coordination with the providers or over a longer period of time.When interpreting the frequencies of plant species, genera and families visited, it should be noted that the Pl@ntNet model may not be able to identify the different taxonomic groups equally well.This would distort the results, making it difficult to speak of pollinator preferences.
There are some general limitations when working with second- ary data from open sources such as citizen science platforms or social media, that also apply here, for example data bias.Which plants the citizen scientists prefer to photograph may be a factor that influences the frequency of plant species as well as the flower colors.
Lastly, using secondary data could raise ethical concerns, especially regarding privacy, and consent (Di Minin et al., 2021), which the user needs to be aware of, particularly if the observer becomes the subject of the research.

| CON CLUS IONS
We demonstrate that our approach and Pl@ntNet can be a valuable and efficient tool for genus and family level investigations of insect flower visitations from citizen science images.For high confidence scores, even species identification with Pl@ntNet is worthwhile, given that the results could be reviewed by experts in a non-blind hybrid intelligence approach, which reduces the workload significantly.

Figure 3
Figure3illustrates for which plant species, genera and families the identification matched between expert and Pl@ntNet's first
Secondary data can also be used as complementary data for field experiments to supplement them or to validate insights by up-scaling.Many areas of application are imaginable, but the potential of secondary data must now be utilized and tested by the scientific community.Table of plant species visited by Isodontia mexicana based on the expert identifications and grouped by region, also showing the frequency of the identified plants (n identification ) and number of matches with Pl@ntNet not considering any score threshold (n matches ).
cities with a high proportion of introduced plants, particularly in urban novel ecosystems-even in international approaches.Beyond invasion biology, phenological studies are conceivable, such as the shift of food plants over the activity period of a species.Besides typical pollinators such as butterflies, bees or hoverflies, food plants of herbivorous insects from the Orthoptera or Hemiptera groups could also be investigated.